Start: 09 November 2017 (16:18:28)
nutri_orig <- read_xlsx("data/insa_tca.xlsx")
#glimpse(nutri_orig)
Problem: Variable names of the the original data and the risk of incorrectly tidying the data.
When we look at the nutri_long dataset we see that we could lose the ordering of the keyVars. We need to protect again this causing a problem during the data tidying. The problem stems from the original data, looking at the names of nutri_orig:
names(nutri_orig)[1:12]
## [1] "X__1" "Nome do alimento"
## [3] "Grupo" "Energia [kcal] (ENERCC)"
## [5] "X__2" "X__3"
## [7] "Energia [kJ] (ENERCJ)" "X__4"
## [9] "X__5" "Lípidos (FAT)"
## [11] "X__6" "X__7"
Variable Energia [kcal] (ENERCC) contains the value for nutrient “Energia [kcal] (ENERCC)”, X__2 the associated unit, and X__3 the associated quantity. It is the same for variable Lípidos (FAT) contains the value for nutrient “Lípidos (FAT)”, X__6 the associated unit, and X__7 the associated quantity. This is the same pattern for all the nutrients as it originates from the original Excel spreadsheet where each nutrients has 3 columns but one header in a merged cell above the 3 columns.
Using the column names of the the original data we will create an ordering variable. First we will remove the the first three variables as they will remain constants as columns.
The we will create a grouping variable for the nutrients value, unit and quantity called grpNtr and grpType.
ordNames <- as_tibble(list(varName = names(nutri_orig)[-c(1, 2, 3)])) %>%
mutate(grpNtr = ceiling(row_number()/3)) %>%
mutate(typeVars = parse_number(varName) %% 2) %>%
mutate(xVars = str_detect(varName, "^X__")) %>%
mutate(grpType = if_else(typeVars == 1, "Quantity", "Unit")) %>%
mutate(grpType = if_else(xVars == FALSE, "Nutrient", grpType)) %>%
select(-xVars, -typeVars)
## Warning: 37 parsing failures.
## row # A tibble: 5 x 4 col row col expected actual expected <int> <int> <chr> <chr> actual 1 1 NA a number Energia [kcal] (ENERCC) row 2 4 NA a number Energia [kJ] (ENERCJ) col 3 7 NA a number Lípidos (FAT) expected 4 10 NA a number Ácidos gordos saturados (FASAT) actual 5 13 NA a number Ácidos gordos monoinsaturados (FAMS)
## ... ................. ... ........................................................... ........ ........................................................... ...... ........................................................... ... ........................................................... ... ........................................................... ........ ........................................................... ...... ...........................................................
## See problems(...) for more details.
head(ordNames, n=12)
## # A tibble: 12 x 3
## varName grpNtr grpType
## <chr> <dbl> <chr>
## 1 Energia [kcal] (ENERCC) 1 Nutrient
## 2 X__2 1 Unit
## 3 X__3 1 Quantity
## 4 Energia [kJ] (ENERCJ) 2 Nutrient
## 5 X__4 2 Unit
## 6 X__5 2 Quantity
## 7 Lípidos (FAT) 3 Nutrient
## 8 X__6 3 Unit
## 9 X__7 3 Quantity
## 10 Ácidos gordos saturados (FASAT) 4 Nutrient
## 11 X__8 4 Unit
## 12 X__9 4 Quantity
From the nutri_orig data we will create a long format
Now we will merge the group ordering to the nutri_long dataset
nutri_long <- nutri_orig %>%
rename(foodID = X__1,
foodItem = `Nome do alimento`,
foodGroup = Grupo) %>%
gather(key = "keyVars", value = "keyVals", -starts_with("food")) %>%
select(foodID, foodGroup, foodItem, keyVars, keyVals) %>%
left_join(ordNames, by= c("keyVars" = "varName"))
glimpse(nutri_long)
## Observations: 136,407
## Variables: 7
## $ foodID <dbl> 619, 620, 802, 803, 703, 704, 646, 346, 345, 971, 97...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem <chr> "\"Donut\"", "\"Donut\" recheado com doce de fruta",...
## $ keyVars <chr> "Energia [kcal] (ENERCC)", "Energia [kcal] (ENERCC)"...
## $ keyVals <chr> "400", "348", "878", "900", "114", "293", "11", "78"...
## $ grpNtr <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ grpType <chr> "Nutrient", "Nutrient", "Nutrient", "Nutrient", "Nut...
In the nutri_long dataset we have 136407 observations. Our final tidy dataset, which we call nutri_tidy should contain one third of these values (i.e., 45469).
# Prepare the Units and Quantity so that they can be merged to the Nutrient data.
nUnit <- nutri_long %>%
filter(grpType == "Unit") %>%
select(foodID, grpNtr, Unit = keyVals)
head(nUnit)
## # A tibble: 6 x 3
## foodID grpNtr Unit
## <dbl> <dbl> <chr>
## 1 619 1 quilocaloria
## 2 620 1 quilocaloria
## 3 802 1 quilocaloria
## 4 803 1 quilocaloria
## 5 703 1 quilocaloria
## 6 704 1 quilocaloria
nQty <- nutri_long %>%
filter(grpType == "Quantity") %>%
select(foodID, grpNtr, Quantity = keyVals)
head(nQty)
## # A tibble: 6 x 3
## foodID grpNtr Quantity
## <dbl> <dbl> <chr>
## 1 619 1 por 100 g de parte edível
## 2 620 1 por 100 g de parte edível
## 3 802 1 por 100 g de parte edível
## 4 803 1 por 100 g de parte edível
## 5 703 1 por 100 g de parte edível
## 6 704 1 por 100 g de parte edível
# Prepare the base for the Nutrition data by keeping only the nutrient and values
nutri_tidy <- nutri_long %>%
filter(grpType == "Nutrient") %>%
rename(Value = keyVals,
Nutrient = keyVars) %>%
left_join(nUnit, by = c("foodID", "grpNtr")) %>%
left_join(nQty, by = c("foodID", "grpNtr")) %>%
select(-grpNtr, -grpType) %>%
arrange(foodID, Nutrient )
glimpse(nutri_tidy)
## Observations: 45,469
## Variables: 7
## $ foodID <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ foodGroup <chr> "Leite e produtos lácteos", "Leite e produtos lácteo...
## $ foodItem <chr> "Leite de cabra cru", "Leite de cabra cru", "Leite d...
## $ Nutrient <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value <chr> "0.03", "0.1", "1.1", "0.1", "2.6", "0.1", "0", "86....
## $ Unit <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity <chr> "por 100 g de parte edível", "por 100 g de parte edí...
Data is tidy but not totally cleaned yet
nutri_tidy %>% arrange(foodGroup, foodItem) %>% select(foodItem)
## # A tibble: 45,469 x 1
## foodItem
## <chr>
## 1 "\"Donut\""
## 2 "\"Donut\""
## 3 "\"Donut\""
## 4 "\"Donut\""
## 5 "\"Donut\""
## 6 "\"Donut\""
## 7 "\"Donut\""
## 8 "\"Donut\""
## 9 "\"Donut\""
## 10 "\"Donut\""
## # ... with 45,459 more rows
We use stringr to clean up the values from foodItem where we can still find quotation marks (“”)
nutri_clean <- nutri_tidy %>%
# clean observations
mutate(foodItem = str_replace_all(
foodItem, "\\p{quotation mark}", ""),
Nutrient = str_replace_all(
Nutrient, "\\+", "")
) %>%
arrange(foodGroup, foodItem)
DT::datatable(nutri_clean)
## Warning in instance$preRenderHook(instance): It seems your data is too
## big for client-side DataTables. You may consider server-side processing:
## http://rstudio.github.io/DT/server.html
Now when we search for “Donut” on the search window we no longer see it with quotation marks.
When we glimpse() the data we still see that the variable types are not accordingly
glimpse(nutri_clean)
## Observations: 45,469
## Variables: 7
## $ foodID <dbl> 804, 804, 804, 804, 804, 804, 804, 804, 804, 804, 80...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem <chr> "Açúcar amarelo", "Açúcar amarelo", "Açúcar amarelo"...
## $ Nutrient <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value <chr> "0", "0", "0", "0", "0", "0", "0", "2", "0", "0", "4...
## $ Unit <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity <chr> "por 100 g de parte edível", "por 100 g de parte edí...
Convert variable types to their actual type
nutri_clean <- readr::type_convert(nutri_clean)
## Parsed with column specification:
## cols(
## foodGroup = col_character(),
## foodItem = col_character(),
## Nutrient = col_character(),
## Value = col_double(),
## Unit = col_character(),
## Quantity = col_character()
## )
glimpse(nutri_clean)
## Observations: 45,469
## Variables: 7
## $ foodID <dbl> 804, 804, 804, 804, 804, 804, 804, 804, 804, 804, 80...
## $ foodGroup <chr> "Açúcar, confeitaria e sobremesas doces à base de ág...
## $ foodItem <chr> "Açúcar amarelo", "Açúcar amarelo", "Açúcar amarelo"...
## $ Nutrient <chr> "a-tocoferol (TOCPHA)", "Ácido linoleico (F18:2CN6)"...
## $ Value <dbl> 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 0.00, 2.00, 0.00...
## $ Unit <chr> "milligrama", "grama", "grama", "grama", "grama", "g...
## $ Quantity <chr> "por 100 g de parte edível", "por 100 g de parte edí...
End: 09 November 2017 (16:18:33)